Comparison of earthquake-induced shallow landslide susceptibility assessment based on two-category LR and KDE-MLR

Geological hazards caused by strong earthquakes have caused continuous social and economic losses and destruction of the ecological environment in the hazard area, and are mostly manifested in the areas with frequent occurrence of geological hazards or the clustering of geological hazards. Considering the long-term nature of earthquakes and geological disasters in this region, this paper takes ten earthquake-stricken areas in Wenchuan earthquake zone as examples to collect shallow landslide data in 2010, combined with the spatial location of landslides and other factors. Kernel density estimation (KDE) method is used to analyze the spatial characteristics of shallow landslide. Taking the space of shallow landslide as the characteristic variable and fully considering the regulating factors of earthquake-induced landslide: terrain complexity, distance to river, distance to fault, distance to road, lithology, normalized vegetation difference index (NDVI) and ground peak acceleration (PGA) as independent variables, based on KDE and polynomial logistic regression (MLR), A quantitative model of shallow landslide in the earthquake area is constructed. The results show that: (1) PGA has the greatest impact on landslide in the study area. (2) Compared with the two-category logistic regression (two-category LR) model, the susceptibility map of landslide prediction results based on the KDE-MLR landslide susceptibility prediction model is more consistent with the actual situation. (3) The prediction accuracy of the model validation set is 70.7%, indicating that the landslide susceptibility prediction model based on KDE-MLR can effectively highlight the spatial characteristics of shallow landslides in 10 extreme disaster areas. The research results can provide decision-making basis for shallow landslide warning and post-disaster reconstruction in earthquake-stricken areas.

as the spatial probability of landslide occurrence based on a set of geological and environmental conditions. A landslide susceptibility map reflects the spatial distribution of the landslide probability in an area 16 . This analysis can be traced back to the Japanese scholar Saito, who, in the 1960s, used the results of creep tests to predict the locations of shallow landslides. After completing a series of related theoretical systems, many scholars have conducted exploratory research on other quantitative spatial prediction models of regional geological hazards, including empirical models (such as fuzzy logic and generalized additive models) 17,18 , statistical analysis models (such as the certainty coefficient method, weights of evidence and entropy models) [19][20][21] and pattern recognition models (such as artificial neural networks, support vector machines (SVMs) and adaptive, neuro-fuzzy inference systems (ANFISs)) [22][23][24] . Two-category LR is a generalized linear regression analysis model widely used in exploring landslide susceptibility due to the simple calculations and physical clear meaning characteristic. However, this model is relatively simple and cannot handle complex issues in actual situations, some scholars have proposed MLR. The MLR is useful in situations with several dependent variables 25 . Previous findings indicate that the twocategory LR is widely used in shallow landslide susceptibility evaluations 26,27 . The dependent variables of most two-category LR in susceptibility analysis of shallow landslides are binary logic occurrence variables (denoted as 1) and nonoccurrence variables (represented as 0). Notably, the intensity of regional shallow landslides (spatial effects) is not considered in this model resulting in significant forecast biases.
Tobler's first law of geography states that everything is related to other things but near things are more related than distant things; that is, there is a potential dependence between the observed data of certain variables in the same or different distribution areas 28 . Landslides are spatially manifested as regions where geological hazards or groups of hazards frequently occur. In the past, the spatial quantitative modelling of landslides primarily comprised use of statistics to explore the characteristics and development of hazards (whether a hazard occurred or not). Although the spatial distribution of geological hazards appears random, it exhibits inherent regularity, thus spatial autocorrelation is widely used in the study of geological hazards 29,30 . Zou et al. 29 established spatial regression models using the relevant geological hazard data, evaluated the impact of human activities on geological hazards in the Shennongjia Mountain area, and proposed effective prevention and control strategies. Liu et al. 31 explored the spatial-temporal distribution and conditioning factors of geological hazards. Gonzalez et al. 32 investigated the spatial distribution characteristics of seismic hazards in cities and the impact of river vaults on them using the conventional spatial autocorrelation method (SPAC) and Calicatas SPAC method. Previous findings indicate that evaluating the spatial characteristics of landslides improves the accuracy of spatial prediction of landslides.
The purpose of this paper is to build a model to predict shallow landslide susceptibility. The disaster intensity of regional shallow landslides was quantitatively characterized by the model, and an objective understanding of landslide spatial characteristics was used to accurately predict the influence of the landslide disasters. By identifying the shallow landslide susceptibility area, research on shallow landslide spatial characteristics can effectively reduce casualties and property losses. In this paper, ten earthquake-stricken areas affected by the Wenchuan earthquake are studied. The evaluation factors affecting the surface sensibility of shallow landslides are examined from the aspects of topography, geological environment and inducing conditions. KDE is used to describe the multiclassification spatial characteristics of shallow landslides, and the degree of aggregation and dispersion of shallow landslides in space is fully analyzed. A landslide disaster sensitivity prediction model based on KDE-MLR is constructed using MLR. The remainder of the paper is organized as follows: in "Introduction" Section, summarize the research progress of spatial quantitative modeling of landslides. In "Overview of the study area" Section, introduce the 10 severely stricken areas of the Wenchuan earthquake, and the data sources. In "Method" Section, introduce the research methods used in this paper, including: KDE, LR algorithm (two-category LR, MLR), Model performance evaluation index. In "Results" Section, construct the landslide susceptibility prediction model based on KDE-MLR, analyze the influence degree of landslide conditioning factors, evaluate the model and discuss the validity of the model. In "Discussion" Section, conclusions are presented. In "Conclusions" Section, discussion is presented.

Overview of the study area
This paper conducts a spatial prediction of landslides in the 10 extremely earthquake-stricken areas of the Wenchuan earthquake and improves the two types of dependent variables over those used in traditional landslide evaluations (1 for landslide occurrences and 0 for landslide non-occurrences). First, the spatial characteristics of landslides are analyzed by the KDE method. Second, by taking the spatial characteristics of landslides as the dependent variable, using the advantages of the MLR method, and expanding it into three categories, then construct a landslide susceptibility prediction model based on KDE-MLR. Finally, a spatial landslide susceptibility map is generated. The research framework, which is shown in Fig. 1 is divided into five steps: Step 1 Collect landslide geological hazard data in the study area, including the spatial location of landslide points, the number of landslides, and related data on seven conditioning factors.
Step 2 Account for factors such as the spatial locations of the landslides, KDE is used to describe the spatial characteristics of the landslide in the study area.
Step 3 Use the MLR and take the spatial characteristics of landslides as dependent variables and the conditioning factors as independent variables, then construct a landslide susceptibility prediction model based on KDE-MLR.
Step 4 Randomly select 80% of the spatial characteristics of landslides in the study area as training samples and 20% as verification samples, then continuously adjust the model parameters and evaluate the effectiveness and performance of the model. Study area. In 2008, the 5.12 Wenchuan earthquake shook Sichuan, Gansu and Shaanxi provinces. This paper studies the 10 extremely earthquake-stricken areas of the Wenchuan earthquake that were announced by the Ministry of Land and Resources of China. These 10 extremely earthquake-stricken areas suffered the most damage during the earthquake and are all located in Sichuan Province: Wenchuan, Beichuan, Qingchuan, Maoxian, Anxian, Mianzhu, Shifang, Dujiangyan, Pingwu and Pengzhou (Fig. 2). The study area covers approximately 26,000 km 2 and has a total population of over 3.5 million. The topography of the region is relatively complex, with elevations ranging between 490 and 5600 m, and the average annual precipitation is between 490 and 1400 mm 33 . In addition, the terrain in this area is complex, located in the Longmenshan fault zone (one of the fault zones with the largest topographic gradient in the world), and there have been many strong earthquakes. By investigating the occurrence of geological hazards (e.g., landslides) in the extremely earthquake-stricken areas   www.nature.com/scientificreports/ of the Wenchuan earthquake, landslides can be found to have increased sharply 2-3 years after the earthquake 34 . Therefore, this paper takes 2010 as an example to analyze the spatial distribution characteristics of landslides in the 10 extremely earthquake-stricken areas and establishes a quantitative spatial landslide model to provide a reference for hazard monitoring and decision-making to formulate emergency plans.
Data sources. Data preparation. In this study, the landslide that occurred in 2010 (2 years after the Wenchuan earthquake) was selected as the research object due to the long-term nature of the earthquake and geological hazards. The spatial location and number of shallow landslides in the study area of this paper were obtained by Landsat-8 30 m resolution remote sensing data interpretation, and grid processing was carried out with 60 m*60 m pixels in ArcGIS 10.5 software. According to the results of remote sensing analysis and statistics, there were 885 landslide points in the study area in 2010. However, in real life, the occurrence of shallow landslides often changes the whole area, which may make the model ignore the change in nonlandslide points in the calculation process. This reduces the accuracy of the landslide susceptibility evaluation map due to inconsistency with the actual situation. A total of 885 non-landslide points were randomly selected in a 1:1 ratio outside the buffer zone, 10 km away from the landslide point to reduce errors caused by this situation. The spatial distribution of the 885 non-landslide points is shown in Fig. 3.
Conditioning factors of landslides. The selection of conditioning factors is one of the most important tasks in spatial predictive modelling 35 . Although there is currently no clear guideline for selecting conditioning factors when constructing prediction models, the selection of conditioning factors usually depends on the characteristics of the study area and the scale of the hazard 36,37 . For this paper, we selected seven landslide conditioning factors or independent variables based on the quality and availability of data: terrain information entropy (Ht) ( ). Among them, Ht is a topographic and geomorphic factor, distance to rivers, distance to faults, lithology, and NDVI are geological environment factors, and distance to roads and PGA are inducing factors. All conditioning factors were generated using ArcGIS software. The topographic information entropy was calculated using 1:35 0000 DEM data 30 . The distance to the fault was derived from the 1:500,000 regional geological map 38 . The distance to the river was obtained from 1:3 million National Hydrogeological Atlas (https:// www. osgeo. cn/ map/ m04dd). Statistical yearbook and historical data were obtained through remote sensing interpretation. Lithology data were retrieved from 1:500,000 regional geological map 38,39 . NDVI was obtained from Landsat-7 30-m resolution remote sensing interpretation. PGA data were retrieved from the United States

Method
Kernel density estimation (KDE). In order to deeply reveal the spatial distribution characteristics of earthquake and landslide geological hazards, this paper selects the KDE to quantitatively calculate the spatial distribution characteristics of landslides, and analyzes the spatial continuous trend of landslides in the hardest hit area of the Wenchuan earthquake. KDE is mainly used to estimate the density of point or line features around each output raster cell 40 . Through the two-dimensional greyscale expression or three-dimensional surface expression of the earthquake-induced landslide core density calculation results, the distribution characteristics of the cluster or dispersion of the landslide point groups can be simply and intuitively obtained. The two-dimensional kernel density is an extension of the one-dimensional kernel density, and the density value at a certain point is estimated by calculating the kernel estimator: In formula (1), s is the spatial location of an estimated point, n is the number of landslide points within the bandwidth, d is is the distance from the i th landslide point to s , and h is the bandwidth. The bandwidth formula is as follows: In formulas (2) and (3), D m is the median distance from each landslide point to the average centre, N is the sum of the landslide points, SD is the standard distance, M is the total number of landslide points, x i and y i are the coordinates of the landslide points i and X and Y are the average center of the landslide points. K 0 is the kernel function, and its formula is shown:   44,45 . The two-category LR method does not require the independent variables to conform to the normal distribution and it has no restriction on the distribution of the identified variables. It can be used to predict the probability of dependent variables with binomial characteristics. When the dependent variable in the two-category LR model is of two types, namely, occurrence of a landslide (denoted as 1) and nonoccurrence of a landslide (denoted as 0), the relationship between the probability of landslide occurrence and the conditioning factor is shown in the following formula 46 : In formulas (5) and (6), is the two-category LR prediction value representing the probability of landslide occurrence and having a value in the range [0, 1], Z represents the sum of the linear weight values after the variables are superimposed, is the constant term of the logistic regression, xi represents each conditioning factor and is the two-category LR coefficient of the ith conditioning factor.

MLR.
In practical problems, the dependent variable may have multiple values. When there are more than two dependent variable values, MLR analyses need to be used. MLR analysis is a variant of two-category LR and its basic concept is essentially the same as that of two-category LR 47,48 . MLR analysis assumes that dependent variable categories are completely independent. It also takes one of the categories as a reference group and calculates a reference group regression coefficient 49 .
The MLR model can be considered a J-1 two-category LR model of J dependent variables and each independent dependent variable is compared with the reference group to calculate the advantage of the reference group 50 . Assuming that Y = J is the reference group, the LR model of each category is as follows: In formula (7), j = 1, . . . J − 1 , αj is the intercept, xi represents each conditioning factor, and βij is the MLR coefficient of the No. i th conditioning factor. The MLR model is expressed in the form of probability as follows: In formulas (8) and (9), Pj represents the probability that the dependent variable is a certain category and PJ represents the probability of the reference group. The values of the dependent variable Y in this study are 0, 1 and 2. Since most of the locations in the 10 extremely earthquake-stricken areas are non-landslide points, the dependent variable Y = 0 is set as the reference group.
Model performance evaluation index. To evaluate the performance of both the two-category LR and KDE-MLR landslide susceptibility prediction models, this paper uses an error matrix to evaluate the accuracy of each model 51 , as shown in Table 1. The overall accuracy is principally used to determine the performance of the models and is obtained by comparing the predicted model category with the actual category.
In MLR analysis, training sets are used to train the model. When predicting the remaining test datasets, the probability of each category is compared by calculating the predicted dependent variable probabilities for  50 . The effectiveness of the model is determined by its overall accuracy rate, which expresses the accuracy of the model. The accuracy rate is the ratio of the number of samples that predict the correct classification of the model to the total number of samples. The specific formula is as follows:

Results
Calculation results of the spatial characteristics of landslides. To further understand the spatial distribution characteristics of landslides, this paper comprehensively considers the spatial location and number of landslides. This paper also uses the KDE formula in Sect. Kernel density estimation (KDE) to calculate the spatial characteristics of landslides of the 10 extremely earthquake-stricken areas, as shown in Fig. 5. Based on the value of spatial distribution characteristics that calculated by KDE, this paper uses the average value (3.3) as the demarcation point to divide the spatial characteristics of landslides into two levels (High, Medium). Thus, the spatial characteristics of landslides of all landslide points is divided into 2 levels, High and Medium. In Fig. 5, approximately 62% of the landslide points have high spatial aggregation of landslides. These points mainly lie in Wenchuan and Beichuan. The KDE value of 337 landslide disaster points in the entire study area is lower than 3.3, the landslide spatial characteristics is moderate, and the distribution in other counties (cities) is relatively uniform.

Analysis of the influence degree of factors on landslides. Analysis of landslide conditioning factors
based on LR. In the two-category LR landslide prediction model, this paper randomly selects 80% of the 885 landslide and non-landslide points (619 landslide points and 619 non-landslide points) as the training set and 20% (266 landslide points and 266 non-landslide points) as the validation set. Two-category LR is performed based on the extracted landslide conditioning factors and training samples, and the regression coefficient of each conditioning factor is obtained. As shown in Table 2, the greater the absolute value of the coefficient, the greater the impact it has on landslide susceptibility. Table 2 shows that the absolute value of the PGA regression coefficient is the highest (0.418), which indicates that this factor has the largest influence on earthquake-induced landslides. The lithology, distance to rivers and Ht also have relatively large impacts on landslides in the study area, with absolute values of their regression coefficients between 0.2 and 0.4. The regression coefficients of the distance to roads, NDVI and distance to faults are all less than 0.1, indicating that these three conditioning factors have relatively little correlation with landslides.   52 . MLR uses different criteria in determination of the MLR coefficients of discrete variables and continuous variables. Therefore, continuous variables were classified in this study and treated as separate variables ( Table 3).
The results of the landslide susceptibility prediction model based on KDE-MLR were calculated by SPSS Modeler software. The same sample ratio used in the two-category LR landslide prediction model was used to divide the dataset. In this study, 80% of the dataset was used as the training set (0: 708 individuals in this set, 1: 269 individuals in this set, and 2: 439 individuals in this set) and 20% was used as the validation set (0: 177 individuals in this set, 1: 67 individuals in this set, and 2:110 individuals in this set) (0 represents no landslide point). The dataset comprised 3 categories, with Y = 0 as the reference group. Two groups of regression coefficients with different spatial characteristics of landslides levels were obtained. The results are shown in Fig. 6: (1) When the dependent variable is the moderate spatial characteristics of landslides (Y = 1), lithology and NDVI have the highest effect on the region, and the two variables were positively correlated. The result indicated that the probability of Y = 1 was higher in areas with softer lithology and lower NDVI values compared with the reference group. This implies that areas with less vegetation cover and softer strata were prone to moderate spatial aggregation of landslides. The regression coefficients of each level of the distance to the road, the distance to the fault zone, the distance to the river, PGA, and Ht were all negative values. This finding implies that the intermediate level of spatial characteristics of landslides was negatively correlated with these factors. These results indicate that shorter distance to the road, shorter the distance to the fault zone, and shorter distance to the river are associated with moderated spatial characteristics of landslides are more moderate. In addition, the regression coefficients of PGA and Ht fluctuate according to the classification. (2) When the dependent variable represents the high spatial characteristics of landslides (Y = 2), the regression coefficients of NDVI and lithology are basically consistent with the case of Y = 1. In this case, the high spatial characteristics of landslides exhibited a high dependence on lithology and NDVI, and the spatial characteristics of landslides were positively correlated with lithology and NDVI. The regression coefficient value gradually decreased with increase in the distance to faults. However, all coefficients had positive values, indicating that landslide susceptibility was higher in areas closer to the faults and the influence of this factor on Y = 2 was higher than its influence on Y = 0. The regression coefficients of Ht had negative values, implying a larger value of Ht was associated with a higher correlation of spatial characteristics of landslides. The regression coefficient value of PGA was small, but the change was relatively stable. The regression coefficients of the distance to the road and the distance to the river exhibited distinct correlations according to the classification. The absolute value of the regression coefficient was largest when the distance to the road was 1-2 km and the distance to the river was less than 1 km. The impact on the landslide disaster was largest under these conditions.
The importance of each conditioning factor is obtained by combining the two sets of regression coefficients. From Table 4, it can be concluded that the influence on landslide geological hazards from high to low is PGA, NDVI, lithology, distance to roads, distance to rivers, distance to faults, and Ht.  Table 3. Classification of Influential Factors.    Table 1 is used to calculate the overall accuracy of the KDE-MLR landslide susceptibility prediction model. The results in Table 5 show that the overall accuracy of the training datasets used to build the model is 71%, the overall accuracy of the validation datasets used to verify the model is 70.7%, and the overall accuracy is more than 70%. This demonstrates that the landslide susceptibility prediction model based on KDE-MLR credibly predicts the probability distribution of the spatial aggregation of landslides in the study area 57 . Similarly, the two-category LR landslide prediction model was verified. At the same time, the two-category LR landslide prediction model is verified, and its overall accuracy rate is about 72%, indicating that the two-category LR landslide disaster prediction model is also feasible, because the two models have similar accuracy rates. Therefore, the problem of error caused by model accuracy was excluded in the follow-up study.
Comparative analysis of landslide susceptibility. This paper assigns a corresponding weight value to each conditioning factor based on the regression coefficient. Then, by superimposing each conditioning factor layer, distribution maps of landslide predictions for both the two-category LR-based landslide prediction model (Fig. 7a) and the KDE-MLR model (Fig. 7b) are obtained. In Table 6 and Fig. 7: (1) In the predicted results of the two models above, more than 50% of landslides occur in areas with high landslide susceptibility. In the KDE-MLR model, the area with a landslide susceptibility level of 3 is approximately 5441 km 2 . In the three-level landslide susceptibility area, divided by the two-category LR model according to the natural segment point method, the area of the high-level susceptibility is approximately 7104 km 2 , and the landslide density is significantly lower than the predicted result in the KDE-MLR model. Overall, the landslide susceptibility prediction model based on KDE-MLR constructed in this paper can effectively highlight the spatial characteristics of landslides in 10 extremely earthquake-stricken study areas. (2) The landslide susceptibility prediction model based on KDE-MLR is used to predict the hazard level areas with high spatial characteristics of landslides. According to Fig. 7, it can be seen that most of them are distributed in Beichuan and Wenchuan areas. On the whole, with the increase of susceptibility level, the regional landslide density has also gradually increased. Generally, as the susceptibility level increases, the  www.nature.com/scientificreports/ density of regional landslides also increases. From the statistical analysis shown in Fig. 8, there are a total of 552 landslides in Wenchuan and Beichuan, and approximately 71% of those landslides are in areas with a predicted landslide susceptibility levels of 2. In Beichuan, approximately 80% of the landslides are located in the Y = 2 area, which is more consistent with the spatial distribution of the actual landslide points.

Discussion
This paper takes 10 extremely earthquake-stricken areas of Wenchuan as examples, constructs a spatial quantitative modelling of landslide of earthquake-stricken area which based on the KDE-MLR. This provides a decisionmaking basis for the early warning of landslide in earthquake-stricken areas. But it can also be improved from the following aspects: www.nature.com/scientificreports/ (1) The nonlandslide points in this paper are selected at a certain distance away from the disaster points. According to Zuo and Wang 58 , that the positive training set usually uses known data. However, not all the negative points can be treated as true negative samples, which means that selecting of random negative training points creates uncertainty. For the spatial prediction model of landslides that is driven by supervisory data, identifying negative samples will be the focus of research in the future. (2) When predicting landslides, this paper uses the KDE algorithm to describe the spatial characteristics of landslides. Kristan et al. 59 proposed a new online estimation method of the probability density function based on kernel density estimation (KDE). This method maintains and updates the nonparametric model of the observed data, and has great advantages over the selection of bandwidth and kernel function in this paper. Additionally, the advantages of machine learning methods in the spatial prediction of landslides are constantly being shown. Therefore, in future research, this paper will consider to using machine learning methods for prediction.

Conclusions
Taking as examples the landslides in the 10 areas extremely earthquake-stricken areas of the Wenchuan earthquake in 2010, this paper comprehensively considers the spatial location of landslides and other factors after the earthquake and uses the KDE method to quantitatively characterize the spatial characteristics of landslides and improve the intensity description of traditional landslides (the "0-1" occurrence assignment method). Next, the spatial distribution characteristics of landslides after the earthquake are calculated. Then, taking the spatial characteristic of landslides as the dependent variable and each conditioning factor as the independent variable, an MLR model is used to establish a quantitative spatial landslide model. After comparing and analyzing the traditional two-category LR landslides prediction model, the specific conclusions are as follows: (1) In the landslide susceptibility prediction model based on KDE-MLR, the absolute value of the logistic regression coefficient of PGA is the largest (PGA = 0.418), indicating that this conditioning factor has the greatest impact on the landslides in the study area. The value of PGA = 0.42 is basically consistent with these results. NDVI is also has a high value in the KDE-MLR model, however, in the two-category LR model, this conditioning factor has the smallest impact on the prediction results. The predictive variable terrain information entropy is the least important in the KDE-MLR model. The four conditioning factors of lithology, distance to roads, distance to rivers and distance to faults are basically consistent between the two prediction models. (2) The accuracy of the KDE-MLR model when predicting the spatial distribution of landslides in the study area is approximately 70.7%, which is similar to the prediction accuracy of the traditional two-category LR landslide model. However, the landslide prediction susceptibility maps show that the KDE-MLR model susceptibility predictions are more consistent with the spatial distribution of actual landslides. Compared to the landslide prediction results obtained from the two-category LR model, the KDE-MLR model highlights the spatial characteristics of landslides in the 10 extremely earthquake-stricken areas and more accurately expresses the spatial distribution characteristics of landslides. This shows that on the basis of combining GIS technology, the KDE-based, MLR method has great advantages for studying the spatial prediction of earthquake-induced landslides over the traditional, two-category LR method. Furthermore, it has great prospects for promotion and application.

Data availability
The datasets used and analysed during the current study available from the corresponding author on reasonable request.